This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
Not all users receive the same offer, and that is the challenge to solve with this data set.
Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.
Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.
You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.
Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.
To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.
However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.
This makes data cleaning especially important and tricky.
You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.
Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).
The data is contained in three files:
Here is the schema and explanation of each variable in the files:
portfolio.json
profile.json
transcript.json
import sys
!"{sys.executable}" -m pip install https://github.com/pandas-profiling/pandas-profiling/archive/refs/tags/v3.0.0.zip
!jupyter nbextension enable --py widgetsnbextension
!"{sys.executable}" -m pip install panel
!pip install sagemaker==1.72.0
%load_ext autoreload
%autoreload 2
# Our package
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
import pandas as pd
import numpy as np
import math
import json
import pandas
from pandas_profiling import ProfileReport
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer, MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import os
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 150)
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
portfolio
print (f"portfolio: -> {portfolio.shape[0]} rows \n {' '*8} -> {portfolio.shape[1]} columns")
# creating a more representative column for the channels column, only for the eda part
portfolio['channels_eda'] = portfolio['channels'].apply(lambda x: ' '.join(x))
portfolio
display(portfolio.describe(include='all'))
print ()
display(portfolio.info())
HTML REPORTS CANNOT BE DISPLAYED IN JUPYTER NOTEBOOK. IF YOU WANT TO SEE THE BELOW REPORT HAVE A LOOK IN THE CORRESPONDING HTML DOCUMENT OR IN THE HTML VERSION OF THIS NOTEBOOK
portfolio_report = ProfileReport(
portfolio.loc[:, portfolio.columns != 'id']
, title="Portfolio Exploratory Data Analysis Report"
, html={"style": {"full_width": True}}
, explorative=True
, sort=None
)
portfolio_report.to_file("portfolio_report.html")
portfolio_report
# remove channels_eda column which has been created only for the eda part
portfolio = portfolio.drop(['channels_eda'], axis=1)
# read in the json files
profile = pd.read_json('data/profile.json', orient='records', lines=True)
profile
print (f"profile: -> {profile.shape[0]} rows \n {' '*6} -> {profile.shape[1]} columns")
profile.dtypes
# change became_member_on to date format
profile["became_member_on"] = profile["became_member_on"].apply(lambda x: pd.to_datetime(str(x)))
display(profile.dtypes)
print()
profile
display(profile.describe(include='all'))
print ()
display(profile.info())
HTML REPORTS CANNOT BE DISPLAYED IN JUPYTER NOTEBOOK. IF YOU WANT TO SEE THE BELOW REPORT HAVE A LOOK IN THE CORRESPONDING HTML DOCUMENT OR IN THE HTML VERSION OF THIS NOTEBOOK
profile_report = ProfileReport(profile.loc[:, profile.columns != 'id'], title="Profile Exploratory Data Analysis Report", explorative=True)
profile_report.to_file("profile_report.html")
profile_report
# read in the json files
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)
transcript
print (f"transcript: -> {transcript.shape[0]} rows \n {' '*9} -> {transcript.shape[1]} columns")
transcript['value_eda'] = transcript.value.apply(lambda x: json.dumps(x).replace(x['offer id'], portfolio.offer_type[portfolio.id==x['offer id']].values[0]) if 'offer id' in x.keys() else json.dumps(x))
transcript
display(transcript.describe(include='all'))
print ()
display(transcript.info())
HTML REPORTS CANNOT BE DISPLAYED IN JUPYTER NOTEBOOK. IF YOU WANT TO SEE THE BELOW REPORT HAVE A LOOK IN THE CORRESPONDING HTML DOCUMENT OR IN THE HTML VERSION OF THIS NOTEBOOK
transcript_report = ProfileReport(transcript.loc[:, transcript.columns != 'value'], title="Transcript Exploratory Data Analysis Report", explorative=True)
transcript_report.to_file("transcript_report.html")
transcript_report
Looking at the exploratory data analysis of the portfolio dataset at the section 1.1:
id -> offer_idreward -> offer_rewardduration -> offer_duration_daysdifficulty -> offer_difficultychannels columnsoffer_type and channels columns and replace them with their corresponding one hot encoded valuesdef clean_transform_portfolio(portfolio):
"""
Function to clean and transform the portfolio
dataframe based on above requirements
Parameters
portfolio: portfolio dataframe
Returns
portfolio_cleaned: the cleaned portfolio dataframe
"""
# rename columns
portfolio = portfolio.rename(columns = {'id':'offer_id',
'reward':'offer_reward',
'duration':'offer_duration_days',
'difficulty':'offer_difficulty'})
# one hot encode channels
multi_label_binarizer = MultiLabelBinarizer()
multi_label_binarizer.fit(portfolio['channels'])
channels_encoded = pd.DataFrame(
multi_label_binarizer.transform(portfolio['channels']),
columns=multi_label_binarizer.classes_)
# remove offer_type and channels columns
portfolio = portfolio.drop(['channels'], axis=1)
# create final portfolio_cleaned dataframe by adding the encoded columns
portfolio_cleaned = pd.concat([portfolio, channels_encoded], axis=1)
# put all columns in a more representative order
columns_order = ['offer_id', 'offer_type', 'offer_duration_days', 'offer_difficulty',
'offer_reward', 'email', 'mobile', 'social', 'web']
return portfolio_cleaned[columns_order]
portfolio_cleaned = clean_transform_portfolio(portfolio)
portfolio_cleaned
Looking at the exploratory data analysis of the profile dataset at the section 1.2:
became_member_on columnid -> customer_idage -> customer_agebecame_member_on -> customer_registration_yearincome -> customer_incomeprint ('Any NULL value on any row in the profile dataset is equal to:', round((1- (profile.dropna().shape[0]/profile.shape[0])), 3)*100, '%')
print ('Number of customers with 118 age and missing values in general', round(profile['income'][profile.age==118].shape[0]/profile.shape[0], 3)*100, '%')
def clean_transform_profile(profile):
"""
Function to clean and transform the profile
dataframe based on above requirements
Parameters
profile: profile dataframe
Returns
profile_cleaned: the cleaned profile dataframe
"""
# remove any customer with missing value (12.8% of all customers have a missing value and age assigned as 118)
profile_cleaned = profile.dropna().reset_index()
# keep only the year that each customer became member
profile_cleaned['became_member_on'] = pd.DatetimeIndex(profile_cleaned.became_member_on).year
# rename columns
profile_cleaned = profile_cleaned.rename(columns = {'became_member_on':'customer_registration_year',
'income':'customer_income',
'id':'customer_id',
'age':'customer_age',
'gender':'customer_gender'})
# put all columns in a more representative order
columns_order = ['customer_id', 'customer_age', 'customer_gender',
'customer_income', 'customer_registration_year']
return profile_cleaned[columns_order]
profile_cleaned = clean_transform_profile(profile)
profile_cleaned
Looking at the exploratory data analysis of the profile dataset at the section 1.3:
offer id to offer_id for any instance that this is trueperson'-> customer_idtime -> time_hoursdef clean_transcript(transcript, profile):
"""
Function to clean the transcript dataframe
based on above requirements
Parameters
transcript: transcript dataframe
profile: cleaned profile dataframe
Returns
transcript_cleaned: the cleaned transcript dataframe
"""
# remove value_eda column
transcript = transcript.drop(['value_eda'], axis=1)
# rename offer id to offer_id if applicable
transcript['value'] = transcript.value.apply(lambda x: {'offer_id': x['offer id']} if 'offer id' in x.keys() and 'reward' not in x.keys() else x)
# remove any customer that is not appear in the profile dataset
all_customers_list = list(profile['customer_id'])
transcript = transcript[transcript.person.apply(lambda x: x in all_customers_list)].reset_index()
# rename columns
transcript_cleaned = transcript.rename(columns = {'person':'customer_id',
'time':'time_hours',})
# put all columns in a more representative order
columns_order = ['customer_id', 'event', 'value', 'time_hours']
return transcript_cleaned[columns_order]
transcript_cleaned = clean_transcript(transcript, profile_cleaned)
transcript_cleaned
This is a really important step as the label of our dataset will be created.
The label will be named successful_offer and it will be a binary feature with values 0 or 1. These values will represent if an offer was a successful offer, from the company's perspective, for this customer or not.
An successful offer is the offer which follows the below steps:
Offer Received -> Offer Viewed -> Offer Completeddef offer_summary(transcript, portfolio):
"""
Function to create a summary of all offers for each customer which will present
which offer was successful for each customer based on the above logic
Parameters
transcript: cleaned transcript dataframe
portfolio: cleaned portfolio dataframe
Returns
offer_summary: a dataframe with customer_id, offer_id and corresponding label
"""
offer_received = transcript_cleaned.query("event=='offer received'").reset_index(drop=True)
offer_received.value = offer_received.value.apply(lambda x: x['offer_id'])
offer_viewed = transcript_cleaned.query("event=='offer viewed'").reset_index(drop=True)
offer_viewed.value = offer_viewed.value.apply(lambda x: x['offer_id'])
offer_completed = transcript_cleaned.query("event=='offer completed'").reset_index(drop=True)
offer_completed.value = offer_completed.value.apply(lambda x: x['offer_id'])
offer_summary_list = list()
for index, received_row in offer_received.iterrows():
received_customer_id = received_row.customer_id
received_offer_id = received_row.value
received_time_hours = received_row.time_hours
offer_duration_hours = (portfolio_cleaned.query(f"offer_id=='{received_offer_id}'").offer_duration_days).values[0]*24
viewed_df = offer_viewed.query(f"time_hours>={received_time_hours} & time_hours<={received_time_hours+offer_duration_hours} & value=='{received_offer_id}' & customer_id=='{received_customer_id}'")
if not viewed_df.empty:
viewed_time_hours = viewed_df.iloc[0].time_hours
completed_df = offer_completed.query(f"time_hours>={viewed_time_hours} & time_hours<={received_time_hours+offer_duration_hours} & value=='{received_offer_id}' & customer_id=='{received_customer_id}'")
if not completed_df.empty:
successful_offer = 1
else:
successful_offer = 0
offer_summary_list.append([received_customer_id, received_offer_id, successful_offer])
return pd.DataFrame(offer_summary_list, columns=['customer_id', 'offer_id', 'successful_offer'])
if os.path.isfile('offer_summary.csv'):
offer_summary = pd.read_csv('offer_summary.csv')
else:
offer_summary = offer_summary(transcript_cleaned, portfolio_cleaned)
offer_summary.to_csv('offer_summary.csv', index=False)
offer_summary
def data_merge(portfolio, profile, offer_summary):
"""
Creation of the final table by merging
the final cleaned data frames
Parameters
----------
portfolio : cleaned and transformed portfolio data frame
profile : cleaned and transformed profile data frame
offer_summary : final offer_summary table as defined above
Returns
-------
merged_df: merged data frame to be used in our problem
"""
merged_df = pd.merge(offer_summary, profile, on='customer_id')
merged_df = pd.merge(merged_df, portfolio, on='offer_id')
columns_order = ['customer_age', 'customer_gender', 'customer_income', 'customer_registration_year',
'offer_type', 'offer_duration_days', 'offer_difficulty', 'offer_reward',
'email', 'mobile', 'social', 'web', 'successful_offer']
return merged_df[columns_order]
merged_df = data_merge(portfolio_cleaned, profile_cleaned, offer_summary)
merged_df
display(merged_df.describe(include='all'))
print ()
display(merged_df.info())
HTML REPORTS CANNOT BE DISPLAYED IN JUPYTER NOTEBOOK. IF YOU WANT TO SEE THE BELOW REPORT HAVE A LOOK IN THE CORRESPONDING HTML DOCUMENT OR IN THE HTML VERSION OF THIS NOTEBOOK
merged_df_report = ProfileReport(merged_df, title="Final Table Exploratory Data Analysis Report", explorative=True)
merged_df_report.to_file("merged_df_report.html")
merged_df_report
merged_df
sns.pairplot(merged_df[['customer_age', 'customer_income', 'successful_offer']], hue='successful_offer', height=5)
plt.show()
plt.figure(figsize=(12, 4))
sns.countplot(hue="successful_offer", x= "customer_gender", data=merged_df)
plt.show()
plt.figure(figsize=(12, 4))
sns.countplot(hue="successful_offer", x= "offer_type", data=merged_df)
plt.show()
plt.figure(figsize=(12, 4))
sns.countplot(hue="successful_offer", x= "offer_duration_days", data=merged_df)
plt.show()
plt.figure(figsize=(12, 4))
sns.countplot(hue="successful_offer", x= "offer_difficulty", data=merged_df)
plt.show()
plt.figure(figsize=(12, 4))
sns.countplot(hue="successful_offer", x= "offer_reward", data=merged_df)
plt.show()
plt.figure(figsize=(12, 4))
sns.countplot(hue="successful_offer", x= "customer_registration_year", data=merged_df)
plt.show()
Looking at the exploratory data analysis of the final dataset above:
offer_type columncustomer_gender columncustomer_registration_year columnNumerical features:
customer_age columncustomer_income columnoffer_duration_days columnoffer_difficulty columnoffer_reward columnGender O has less than 2% of data and as a result this column should be dropped because it does not bring any value to the final model
email appears to all offers, so 100% of its values are constant and as a result this column should be droppedM is highly correlated with gender F, so one of the two has to be dropped. This is because highly correlated features do not work well in machine learning models and may negatively influence the performance of themdef transform_engineer_merged_df(merged_df):
"""
Function to clean and transform the final
dataframe based on above requirements
Parameters
merged_df: the final merged dataframe
Returns
final_df: the transformed transcript dataframe
"""
# one hot encode offer_type
offer_type_encoded = pd.get_dummies(merged_df['offer_type'])
# one hot encode offer_type
customer_gender_encoded = pd.get_dummies(merged_df['customer_gender'])
# one hot encode customer_registration_year
customer_registration_year_encoded = pd.get_dummies(merged_df['customer_registration_year'])
# create final profile_cleaned dataframe by adding the encoded columns
final_df = pd.concat([merged_df, offer_type_encoded, customer_gender_encoded, customer_registration_year_encoded], axis=1)
# scale numerical features
scaler = MinMaxScaler()
numerical_features = ['customer_age', 'customer_income', 'offer_duration_days', 'offer_difficulty', 'offer_reward']
final_df[numerical_features] = scaler.fit_transform(final_df[numerical_features])
# remove the columns based on above requirements and the encoded columns
final_df = final_df.drop(['offer_type', 'customer_gender', 'customer_registration_year', 'O', 'email', 'F'], axis=1)
# drop duplicates
final_df = final_df.drop_duplicates().reset_index(drop=True)
# order the resulted columns
columns_order = ['customer_age', 'M', 'customer_income'] + list(customer_registration_year_encoded.columns) + \
['bogo', 'discount', 'informational', 'offer_duration_days', 'offer_difficulty', 'offer_reward',
'mobile', 'social', 'web', 'successful_offer']
return final_df[columns_order]
final_df = transform_engineer_merged_df(merged_df)
final_df
On this step we will split the data into three different sets (training, validation & test sets)
equal number of successful and unseccessful offers per offer type)class data_split():
"""
Class to split the data in training and test set
with the option to balance the dataset as well
Parameters
df: the final merged dataframe
label: the target label
Returns
X_train, X_test, y_train, y_test: the training and test sets
"""
def __init__(self, df, label):
self.X = df.loc[:, df.columns != label]
self.y = df[label]
def balance_data(self, X, y):
X_bogo = X.query('bogo==1')
X_discount = X.query('discount==1')
X_informational = X.query('informational==1')
index_0 = list()
index_1 = list()
for temp_X in [X_bogo, X_discount, X_informational]:
temp_y = y.loc[list(temp_X.index)]
count_0, count_1 = temp_y.value_counts()[0], temp_y.value_counts()[1]
if count_1>count_0:
index_0.append(list(temp_y[temp_y==0].index))
index_1.append(list(temp_y[temp_y==1].sample(n=count_0, random_state=42).index))
else:
index_1.append(list(temp_y[temp_y==1].index))
index_0.append(list(temp_y[temp_y==0].sample(n=count_1, random_state=42).index))
X = X.loc[[j for sub in index_0+index_1 for j in sub]]
y = y.loc[[j for sub in index_0+index_1 for j in sub]]
return X, y
def train_test_val_split(self):
X_train, X_test, y_train, y_test = train_test_split(self.X, self.y, test_size=0.3, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
X_train, y_train = self.balance_data(X_train, y_train)
X_val, y_val = self.balance_data(X_val, y_val)
return X_train, X_val, X_test, y_train, y_val, y_test
ds = data_split(final_df, 'successful_offer')
X_train, X_val, X_test, y_train, y_val, y_test = ds.train_test_val_split()
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
clf.predict_proba(X_test)
Prepare the training & test sets
X_train_bogo = X_train.query("bogo==1").drop(columns=['bogo', 'discount', 'informational'])
y_train_bogo = y_train.loc[X_train_bogo.index]
X_test_bogo = X_test.query("bogo==1").drop(columns=['bogo', 'discount', 'informational'])
y_test_bogo = y_test.loc[X_test_bogo.index]
X_train_bogo.shape, y_train_bogo.shape, X_test_bogo.shape, y_test_bogo.shape
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train_bogo, y_train_bogo)
y_pred = clf.predict(X_test_bogo)
accuracy_score(y_test_bogo, y_pred)
Prepare validation sets for XGBoost
X_val_bogo = X_val.query("bogo==1").drop(columns=['bogo', 'discount', 'informational'])
y_val_bogo = y_val.loc[X_val_bogo.index]
X_val_bogo.shape, y_val_bogo.shape
Prepare the training & test sets
X_train_discount = X_train.query("discount==1").drop(columns=['bogo', 'discount', 'informational'])
y_train_discount = y_train.loc[X_train_discount.index]
X_test_discount = X_test.query("discount==1").drop(columns=['bogo', 'discount', 'informational'])
y_test_discount = y_test.loc[X_test_discount.index]
X_train_discount.shape, y_train_discount.shape, X_test_discount.shape, y_test_discount.shape
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train_discount, y_train.loc[X_train_discount.index])
y_pred = clf.predict(X_test_discount)
accuracy_score(y_test.loc[X_test_discount.index], y_pred)
Prepare validation sets for XGBoost
X_val_discount = X_val.query("discount==1").drop(columns=['bogo', 'discount', 'informational'])
y_val_discount = y_val.loc[X_val_discount.index]
X_val_discount.shape, y_val_discount.shape
Prepare the training & test sets
X_train_informational = X_train.query("informational==1").drop(columns=['bogo', 'discount', 'informational'])
y_train_informational = y_train.loc[X_train_informational.index]
X_test_informational = X_test.query("informational==1").drop(columns=['bogo', 'discount', 'informational'])
y_test_informational = y_test.loc[X_test_informational.index]
X_train_informational.shape, y_train_informational.shape, X_test_informational.shape, y_test_informational.shape
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train_informational, y_train.loc[X_train_informational.index])
y_pred = clf.predict(X_test_informational)
accuracy_score(y_test.loc[X_test_informational.index], y_pred)
Prepare validation sets for XGBoost
X_val_informational = X_val.query("informational==1").drop(columns=['bogo', 'discount', 'informational'])
y_val_informational = y_val.loc[X_val_informational.index]
X_val_informational.shape, y_val_informational.shape
XGBoost models will be created and used on AWS
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.predictor import csv_serializer
# this is an object that represents the SageMaker session that we are currently operating in. This
# object contains some useful information that we will need to access later such as our region.
session = sagemaker.Session()
# this is an object that represents the IAM role that we are currently assigned. When we construct
# and launch the training job later we will need to tell it what IAM role it should have. Since our
# use case is relatively simple we will simply assign the training job the role we currently have.
role = get_execution_role()
When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details.
First we need to create the test, train and validation csv files which we will then upload to S3.
# this is our local data directory. We need to make sure that it exists.
data_dir = '../Starbucks-Capstone-Project/data'
if not os.path.exists(data_dir):
os.makedirs(data_dir)
# we use pandas to save our test, train and validation data to csv files. Note that we make sure not to include header
# information or an index as this is required by the built in algorithms provided by Amazon. Also, for the train and
# validation data, it is assumed that the first entry in each row is the target variable.
X_test.to_csv(os.path.join(data_dir, 'all_offers', 'test.csv'), header=False, index=False)
pd.concat([y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'all_offers', 'validation.csv'), header=False, index=False)
pd.concat([y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'all_offers', 'train.csv'), header=False, index=False)
X_test_bogo.to_csv(os.path.join(data_dir, 'bogo', 'test_bogo.csv'), header=False, index=False)
pd.concat([y_val_bogo, X_val_bogo], axis=1).to_csv(os.path.join(data_dir, 'bogo', 'validation_bogo.csv'), header=False, index=False)
pd.concat([y_train_bogo, X_train_bogo], axis=1).to_csv(os.path.join(data_dir, 'bogo', 'train_bogo.csv'), header=False, index=False)
X_test_discount.to_csv(os.path.join(data_dir, 'discount', 'test_discount.csv'), header=False, index=False)
pd.concat([y_val_discount, X_val_discount], axis=1).to_csv(os.path.join(data_dir, 'discount', 'validation_discount.csv'), header=False, index=False)
pd.concat([y_train_discount, X_train_discount], axis=1).to_csv(os.path.join(data_dir, 'discount', 'train_discount.csv'), header=False, index=False)
X_test_informational.to_csv(os.path.join(data_dir, 'informational', 'test_informational.csv'), header=False, index=False)
pd.concat([y_val_informational, X_val_informational], axis=1).to_csv(os.path.join(data_dir, 'informational', 'validation_informational.csv'), header=False, index=False)
pd.concat([y_train_informational, X_train_informational], axis=1).to_csv(os.path.join(data_dir, 'informational', 'train_informational.csv'), header=False, index=False)
Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project.
prefix = 'starbucks-xgboost'
test_location = session.upload_data(os.path.join(data_dir, 'all_offers', 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'all_offers', 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'all_offers', 'train.csv'), key_prefix=prefix)
test_location_bogo = session.upload_data(os.path.join(data_dir, 'bogo', 'test_bogo.csv'), key_prefix=prefix)
val_location_bogo = session.upload_data(os.path.join(data_dir, 'bogo', 'validation_bogo.csv'), key_prefix=prefix)
train_location_bogo = session.upload_data(os.path.join(data_dir, 'bogo', 'train_bogo.csv'), key_prefix=prefix)
test_location_discount = session.upload_data(os.path.join(data_dir, 'discount', 'test_discount.csv'), key_prefix=prefix)
val_location_discount = session.upload_data(os.path.join(data_dir, 'discount', 'validation_discount.csv'), key_prefix=prefix)
train_location_discount = session.upload_data(os.path.join(data_dir, 'discount', 'train_discount.csv'), key_prefix=prefix)
test_location_informational = session.upload_data(os.path.join(data_dir, 'informational', 'test_informational.csv'), key_prefix=prefix)
val_location_informational = session.upload_data(os.path.join(data_dir, 'informational', 'validation_informational.csv'), key_prefix=prefix)
train_location_informational = session.upload_data(os.path.join(data_dir, 'informational', 'train_informational.csv'), key_prefix=prefix)
Now that we have the training and validation data uploaded to S3, we can construct our XGBoost models and train them. Instead of training a single model, we will use SageMaker's hyperparameter tuning functionality to train multiple models and use the one that performs the best on the validation set.
To begin with, we will need to construct an estimator object.
# as stated above, we use this utility method to construct the image name for the training container.
container = get_image_uri(session.boto_region_name, 'xgboost')
# now that we know which container to use, we can construct the estimator object.
xgb = sagemaker.estimator.Estimator(container, # The name of the training container
role, # The IAM role to use (our current role in this case)
train_instance_count=1, # The number of instances to use for training
train_instance_type='ml.m4.xlarge', # The type of instance ot use for training
output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
# Where to save the output (the model artifacts)
sagemaker_session=session) # The current SageMaker session
Before beginning the hyperparameter tuning, we should make sure to set any model specific hyperparameters that we wish to have default values. There are quite a few that can be set when using the XGBoost algorithm, below are just a few of them. If you would like to change the hyperparameters below or modify additional ones you can find additional information on the XGBoost hyperparameter page
xgb.set_hyperparameters(max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.8,
objective='binary:logistic',
early_stopping_rounds=10,
num_round=200)
Now that we have our estimator object completely set up, it is time to create the hyperparameter tuner. To do this we need to construct a new object which contains each of the parameters we want SageMaker to tune. In this case, we wish to find the best values for the max_depth, eta, min_child_weight, subsample, and gamma parameters. Note that for each parameter that we want SageMaker to tune we need to specify both the type of the parameter and the range of values that parameter may take on.
In addition, we specify the number of models to construct (max_jobs) and the number of those that can be trained in parallel (max_parallel_jobs). In the cell below we have chosen to train 20 models, of which we ask that SageMaker train 3 at a time in parallel. Note that this results in a total of 20 training jobs being executed which can take some time, in this case almost a half hour. With more complicated models this can take even longer so be aware!
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb, # The estimator object to use as the basis for the training jobs.
objective_metric_name = 'validation:rmse', # The metric used to compare trained models.
objective_type = 'Minimize', # Whether we wish to minimize or maximize the metric.
max_jobs = 20, # The total number of models to train
max_parallel_jobs = 3, # The number of models to train in parallel
hyperparameter_ranges = {
'max_depth': IntegerParameter(3, 12),
'eta' : ContinuousParameter(0.05, 0.5),
'min_child_weight': IntegerParameter(2, 8),
'subsample': ContinuousParameter(0.5, 0.9),
'gamma': ContinuousParameter(0, 10),
})
Now that we have our hyperparameter tuner object completely set up, it is time to train it. To do this we make sure that SageMaker knows our input data is in csv format and then execute the fit method.
# this is a wrapper around the location of our train and validation data, to make sure that SageMaker
# knows our data is in csv format.
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')
xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})
As in many of the examples we have seen so far, the fit() method takes care of setting up and fitting a number of different models, each with different hyperparameters. If we wish to wait for this process to finish, we can call the wait() method.
xgb_hyperparameter_tuner.wait()
Once the hyperamater tuner has finished, we can retrieve information about the best performing model.
xgb_hyperparameter_tuner.best_training_job()
In addition, since we'd like to set up a batch transform job to test the best model, we can construct a new estimator object from the results of the best training job. The xgb_attached object below can now be used as though we constructed an estimator with the best performing hyperparameters and then fit it to our training data.
xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())
Testing the model
Now that we have our best performing model, we can test it. To do this we will use the batch transform functionality. To start with, we need to build a transformer object from our fit model.
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
Next we ask SageMaker to begin a batch transform job using our trained model and applying it to the test data we previous stored in S3. We need to make sure to provide SageMaker with the type of data that we are providing to our model, in our case text/csv, so that it knows how to serialize our data. In addition, we need to make sure to let SageMaker know how to split our data up into chunks if the entire data set happens to be too large to send to our model all at once.
Note that when we ask SageMaker to do this it will execute the batch transform job in the background. Since we need to wait for the results of this job before we can continue, we use the wait() method. An added benefit of this is that we get some output from our batch transform job which lets us know if anything went wrong.
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')
Currently the transform job is running but it is doing so in the background. Since we wish to wait until the transform job is done and we would like a bit of feedback we can run the wait() method.
xgb_transformer.wait()
Now the transform job has executed and the result, the estimated sentiment of each review, has been saved on S3. Since we would rather work on this file locally we can perform a bit of notebook magic to copy the file to the data_dir.
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir'/all_offers'
The last step is now to read in the output from our model, convert the output to something a little more usable, in this case we want the sentiment to be either 1 (positive) or 0 (negative), and then compare to the ground truth labels.
predictions = pd.read_csv(os.path.join(data_dir, 'all_offers', 'test.csv.out'), header=None)
y_pred = [round(num) for num in predictions.squeeze().values]
accuracy_score(y_test, y_pred)
predictions.values
Same steps as above
# as stated above, we use this utility method to construct the image name for the training container.
container = get_image_uri(session.boto_region_name, 'xgboost')
# now that we know which container to use, we can construct the estimator object.
xgb = sagemaker.estimator.Estimator(container, # The name of the training container
role, # The IAM role to use (our current role in this case)
train_instance_count=1, # The number of instances to use for training
train_instance_type='ml.m4.xlarge', # The type of instance ot use for training
output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
# Where to save the output (the model artifacts)
sagemaker_session=session) # The current SageMaker session
xgb.set_hyperparameters(max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.8,
objective='binary:logistic',
early_stopping_rounds=10,
num_round=200)
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb, # The estimator object to use as the basis for the training jobs.
objective_metric_name = 'validation:rmse', # The metric used to compare trained models.
objective_type = 'Minimize', # Whether we wish to minimize or maximize the metric.
max_jobs = 20, # The total number of models to train
max_parallel_jobs = 3, # The number of models to train in parallel
hyperparameter_ranges = {
'max_depth': IntegerParameter(3, 12),
'eta' : ContinuousParameter(0.05, 0.5),
'min_child_weight': IntegerParameter(2, 8),
'subsample': ContinuousParameter(0.5, 0.9),
'gamma': ContinuousParameter(0, 10),
})
# This is a wrapper around the location of our train and validation data, to make sure that SageMaker
# knows our data is in csv format.
s3_input_train = sagemaker.s3_input(s3_data=train_location_bogo, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location_bogo, content_type='csv')
xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})
xgb_hyperparameter_tuner.wait()
xgb_hyperparameter_tuner.best_training_job()
xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())
Testing the model
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
xgb_transformer.transform(test_location_bogo, content_type='text/csv', split_type='Line')
xgb_transformer.wait()
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir'/bogo'
predictions = pd.read_csv(os.path.join(data_dir, 'bogo', 'test_bogo.csv.out'), header=None)
y_pred = [round(num) for num in predictions.squeeze().values]
accuracy_score(y_test_bogo, y_pred)
predictions.values
Same steps as above
# as stated above, we use this utility method to construct the image name for the training container.
container = get_image_uri(session.boto_region_name, 'xgboost')
# now that we know which container to use, we can construct the estimator object.
xgb = sagemaker.estimator.Estimator(container, # The name of the training container
role, # The IAM role to use (our current role in this case)
train_instance_count=1, # The number of instances to use for training
train_instance_type='ml.m4.xlarge', # The type of instance ot use for training
output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
# Where to save the output (the model artifacts)
sagemaker_session=session) # The current SageMaker session
xgb.set_hyperparameters(max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.8,
objective='binary:logistic',
early_stopping_rounds=10,
num_round=200)
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb, # The estimator object to use as the basis for the training jobs.
objective_metric_name = 'validation:rmse', # The metric used to compare trained models.
objective_type = 'Minimize', # Whether we wish to minimize or maximize the metric.
max_jobs = 20, # The total number of models to train
max_parallel_jobs = 3, # The number of models to train in parallel
hyperparameter_ranges = {
'max_depth': IntegerParameter(3, 12),
'eta' : ContinuousParameter(0.05, 0.5),
'min_child_weight': IntegerParameter(2, 8),
'subsample': ContinuousParameter(0.5, 0.9),
'gamma': ContinuousParameter(0, 10),
})
# This is a wrapper around the location of our train and validation data, to make sure that SageMaker
# knows our data is in csv format.
s3_input_train = sagemaker.s3_input(s3_data=train_location_discount, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location_discount, content_type='csv')
xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})
xgb_hyperparameter_tuner.wait()
Once the hyperamater tuner has finished, we can retrieve information about the best performing model.
xgb_hyperparameter_tuner.best_training_job()
xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())
Testing the model
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
xgb_transformer.transform(test_location_discount, content_type='text/csv', split_type='Line')
xgb_transformer.wait()
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir'/discount'
predictions = pd.read_csv(os.path.join(data_dir, 'discount', 'test_discount.csv.out'), header=None)
y_pred = [round(num) for num in predictions.squeeze().values]
accuracy_score(y_test_discount, y_pred)
predictions.values
Same steps as above
# as stated above, we use this utility method to construct the image name for the training container.
container = get_image_uri(session.boto_region_name, 'xgboost')
# now that we know which container to use, we can construct the estimator object.
xgb = sagemaker.estimator.Estimator(container, # The name of the training container
role, # The IAM role to use (our current role in this case)
train_instance_count=1, # The number of instances to use for training
train_instance_type='ml.m4.xlarge', # The type of instance ot use for training
output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
# Where to save the output (the model artifacts)
sagemaker_session=session) # The current SageMaker session
xgb.set_hyperparameters(max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.8,
objective='binary:logistic',
early_stopping_rounds=10,
num_round=200)
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb, # The estimator object to use as the basis for the training jobs.
objective_metric_name = 'validation:rmse', # The metric used to compare trained models.
objective_type = 'Minimize', # Whether we wish to minimize or maximize the metric.
max_jobs = 20, # The total number of models to train
max_parallel_jobs = 3, # The number of models to train in parallel
hyperparameter_ranges = {
'max_depth': IntegerParameter(3, 12),
'eta' : ContinuousParameter(0.05, 0.5),
'min_child_weight': IntegerParameter(2, 8),
'subsample': ContinuousParameter(0.5, 0.9),
'gamma': ContinuousParameter(0, 10),
})
# This is a wrapper around the location of our train and validation data, to make sure that SageMaker
# knows our data is in csv format.
s3_input_train = sagemaker.s3_input(s3_data=train_location_informational, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location_informational, content_type='csv')
xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})
xgb_hyperparameter_tuner.wait()
xgb_hyperparameter_tuner.best_training_job()
xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())
Testing the model
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
xgb_transformer.transform(test_location_informational, content_type='text/csv', split_type='Line')
xgb_transformer.wait()
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir'/informational'
predictions = pd.read_csv(os.path.join(data_dir, 'informational', 'test_informational.csv.out'), header=None)
y_pred = [round(num) for num in predictions.squeeze().values]
accuracy_score(y_test_informational, y_pred)
predictions.values
X_test_use_case = X_test.sample(n=100, random_state=9)
X_test_use_case_no_offer_info = X_test_use_case.drop(columns=['bogo', 'discount', 'informational'])
y_test_use_case = y_test.loc[X_test_use_case.index]
X_test_use_case.shape, X_test_use_case_no_offer_info.shape, y_test_use_case.shape
When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details.
First we need to create the test, train and validation csv files which we will then upload to S3.
X_test_use_case.to_csv(os.path.join(data_dir, 'use_case', 'test_use_case.csv'), header=False, index=False)
X_test_use_case_no_offer_info.to_csv(os.path.join(data_dir, 'use_case', 'test_use_case_no_offer_info.csv'), header=False, index=False)
Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project.
prefix = 'starbucks-xgboost'
test_location_use_case = session.upload_data(os.path.join(data_dir, 'use_case', 'test_use_case.csv'), key_prefix=prefix)
test_location_use_case_no_offer_info = session.upload_data(os.path.join(data_dir, 'use_case', 'test_use_case_no_offer_info.csv'), key_prefix=prefix)
xgb_attached = sagemaker.estimator.Estimator.attach('xgboost-220212-2140-009-fbc78020') # best model's details have been taken from 6.3.1 section
Testing the model
Now that we have our best performing model, we can test it. To do this we will use the batch transform functionality. To start with, we need to build a transformer object from our fit model.
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
Next we ask SageMaker to begin a batch transform job using our trained model and applying it to the test data we previous stored in S3. We need to make sure to provide SageMaker with the type of data that we are providing to our model, in our case text/csv, so that it knows how to serialize our data. In addition, we need to make sure to let SageMaker know how to split our data up into chunks if the entire data set happens to be too large to send to our model all at once.
Note that when we ask SageMaker to do this it will execute the batch transform job in the background. Since we need to wait for the results of this job before we can continue, we use the wait() method. An added benefit of this is that we get some output from our batch transform job which lets us know if anything went wrong.
xgb_transformer.transform(test_location_use_case, content_type='text/csv', split_type='Line')
Currently the transform job is running but it is doing so in the background. Since we wish to wait until the transform job is done and we would like a bit of feedback we can run the wait() method.
xgb_transformer.wait()
Now the transform job has executed and the result, the estimated sentiment of each review, has been saved on S3. Since we would rather work on this file locally we can perform a bit of notebook magic to copy the file to the data_dir.
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir'/all_offers'
The last step is now to read in the output from our model, convert the output to something a little more usable, in this case we want the sentiment to be either 1 (positive) or 0 (negative), and then compare to the ground truth labels.
predictions_all_offers = pd.read_csv(os.path.join(data_dir, 'all_offers', 'test_use_case.csv.out'), header=None)
y_pred = [round(num) for num in predictions_all_offers.squeeze().values]
accuracy_score(y_test_use_case, y_pred)
predictions_all_offers.values
xgb_attached = sagemaker.estimator.Estimator.attach('xgboost-220212-2217-010-2fb6076e') # best model's details have been taken from 6.3.2 section
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
xgb_transformer.transform(test_location_use_case_no_offer_info, content_type='text/csv', split_type='Line')
xgb_transformer.wait()
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir'/bogo'
predictions_bogo = pd.read_csv(os.path.join(data_dir, 'bogo', 'test_use_case_no_offer_info.csv.out'), header=None)
predictions_bogo.values
xgb_attached = sagemaker.estimator.Estimator.attach('xgboost-220212-2350-004-7621399f') # best model's details have been taken from 6.3.3 section
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
xgb_transformer.transform(test_location_use_case_no_offer_info, content_type='text/csv', split_type='Line')
xgb_transformer.wait()
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir'/discount'
predictions_discount = pd.read_csv(os.path.join(data_dir, 'discount', 'test_use_case_no_offer_info.csv.out'), header=None)
predictions_discount.values
xgb_attached = sagemaker.estimator.Estimator.attach('xgboost-220213-0029-020-84539665') # best model's details have been taken from 6.3.4 section
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
xgb_transformer.transform(test_location_use_case_no_offer_info, content_type='text/csv', split_type='Line')
xgb_transformer.wait()
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir'/informational'
predictions_informational = pd.read_csv(os.path.join(data_dir, 'informational', 'test_use_case_no_offer_info.csv.out'), header=None)
predictions_informational.values
Now that we have all the results from all four models for each customer, we are able to provide an output which would help the marketing or any similar team to make a final decision about the most suitable offer for each customer
Same logic can be used in day to day business, for multiple customers or just one customer
for i in range (X_test_use_case.shape[0]):
print (f"{'-'*50}Customer with index {i} {'-'*50}")
customer = X_test_use_case.iloc[i]
if customer.bogo==1:
print (f"If the customer receives only a bogo offer, the likelihood of succefull completing the offer is: {round(predictions_all_offers.iloc[i][0],2)}")
elif customer.discount==1:
print (f"If the customer receives only a discount offer, the likelihood of succefull completing the offer is: {round(predictions_all_offers.iloc[i][0],2)}")
else:
print (f"If the customer receives only a informational offer, the likelihood of succefull completing the offer is: {round(predictions_all_offers.iloc[i][0],2)}")
bogo_prediction = round(predictions_bogo.iloc[i][0]*100,2)
discount_prediction = round(predictions_discount.iloc[i][0]*100,2)
informational_prediction = round(predictions_informational.iloc[i][0]*100,2)
print (f"This customer is suitable for a bogo offer with a confidence of: {bogo_prediction}%")
print (f"This customer is suitable for a discount offer with a confidence of: {discount_prediction}%")
print (f"This customer is suitable for a informational offer with a confidence of: {informational_prediction}%")
print ('')
if bogo_prediction>discount_prediction and bogo_prediction>informational_prediction:
if bogo_prediction<50:
print (f"{'*'*20} Most suitable offer for this customer is a BOGO offer, BUT THE LIKELIHOOD OF COMPLETING IS BELOW 0.5. BETTER TO SEND AN INFORMATIONAL OFFER ONLY {'*'*20} ")
else:
print (f"{'*'*20} Most suitable offer for this customer is a BOGO offer {'*'*20} ")
elif discount_prediction>bogo_prediction and discount_prediction>informational_prediction:
if discount_prediction<50:
print (f"{'*'*20} Most suitable offer for this customer is a DISCOUNT offer, BUT THE LIKELIHOOD OF COMPLETING IS BELOW 0.5. BETTER TO SEND AN INFORMATIONAL OFFER ONLY {'*'*20} ")
else:
print (f"{'*'*20} Most suitable offer for this customer is a DISCOUNT offer {'*'*20} ")
else:
print (f"{'*'*20} Most suitable offer for this customer is a INFORMATIONAL offer {'*'*20} ")
print ('')
print ('')